Deriving Private Information from Randomized Data - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Deriving Private Information from Randomized Data

Description:

First PC: containing the greatest amount of variation. ... Then more dimensions can be reduced without causing too much information loss ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 23
Provided by: Supe171
Category:

less

Transcript and Presenter's Notes

Title: Deriving Private Information from Randomized Data


1
Deriving Private Information from Randomized Data
  • Zhengli Huang, Wenliang Du and Biao Chen
  • SIGMOD2005

2
Outline
  • Motivation
  • Reconstruction Algorithms
  • Improve Random Perturbation
  • Conclusion

3
Privacy-Preserving Data Mining
Classification Association Rules Clustering
Data Mining
Central Database
Data Collection
Data Disguising
4
Random Perturbation
Original Data X
Random Noise R
Disguised Data Y

5
Problem
  • How secure is Randomization Perturbation?

6
Example
  • This is the data set (x, x, x, x, x, , x)
  • Random Perturbation
  • (xr1, xr2,, xrm)
  • rt is random number with mean 0
  • So the perturbation numbers mean converges to x,
    when m becomes large.
  • We know this is NOT safe.
  • Observation the data set is highly correlated.

7
Summary
  • We assume that the relationships among data
    attributes might be the key factor that decides
    how much privacy can be preserved.
  • We introduce two methods that exploit the
    correlations among data to reconstruct the
    original data from a randomized data set.

8
Data Reconstruction (DR)
Distribution of random noise
Reconstructed Data X
Data Reconstruction
Whats their difference?
Disguised Data Y
Original Data X
9
Reconstruction Algorithms
  • Principal Component Analysis (PCA)
  • Bayes Estimate Method (BE)

10
PCA-Based Reconstruction
Disguised Information
Reconstructed Information
compression
Information Loss
11
PCA Introduction
  • The main use of PCA reduce the dimensionality of
    a data set with interrelated variables, but still
    contain as much variance of the data set as
    possible
  • PCA can transform the data set to a new data set
    with variables, which are uncorrelated

12
PCA Introduction
  • First PC containing the greatest amount of
    variation.
  • Second PC containing the next largest amount of
    variation.

13
For the Original Data
  • They are correlated.
  • If we remove 50 of the dimensions, the actual
    information loss might be less than 10.
  • If the rest m-p non-PCs remove, it do not cause
    much information loss.

14
For the Random Noises
  • They are not correlated.
  • Their variance is evenly distributed to any
    direction.
  • The information loss for the noise increases.

15
Summary
  • If the data is highly correlated
  • Then more dimensions can be reduced without
    causing too much information loss for the
    original data.

16
Bayes-Estimate-Based Reconstruction
Possible X
Possible X
Possible X
What is the Most likely X?
Random Noise
Disguised Data Y
17
The Problem Formulation
  • For each possible X, there is a probability P(X
    Y).
  • Find an X, s.t., P(X Y) is maximized.
  • How to compute P(X Y)?

18
Computing P(X Y)?
  • P(XY) P(YX) P(X) / P(Y)
  • P(YX) remember Y X R
  • P(Y) A constant (we dont care)
  • How to get P(X)?
  • This is where the correlation can be used.
  • Assume Multivariate Gaussian Distribution
  • The parameters are unknown.

19
Multivariate Gaussian Distribution
  • A Multivariate Gaussian distribution
  • Each variable is a Gaussian distribution with
    mean ?i
  • Mean vector ? (?1 ,, ?m)
  • Covariance matrix ?
  • Both ? and ? can be estimated from Y
  • So we can get P(X)

20
Improved Randomization Scheme
  • From the analysis of PCA
  • While most of the information of the original
    data concentrates on the PC, we discard those
    non-PCs, we can remove more noises than what we
    do to the original data.
  • However, if random noises also concentrate on the
    PCs, separating original data from random noises
    becomes difficult.

21
Observation from PCA
  • How to make it difficult to squeeze out noise?
  • Noise now concentrates on the principal
    components.
  • Make the correlation of the noise similar to the
    original data.

22
Conclusion
  • Can other information affect privacy?
Write a Comment
User Comments (0)
About PowerShow.com