Privacy-Preserving Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Privacy-Preserving Data Mining

Description:

Title: Privacy-Preserving Data Mining Author: caoli Last modified by: caoli Created Date: 10/5/2003 11:24:17 PM Document presentation format: – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 41

Provided by: cao72

Category:

more less

Transcript and Presenter's Notes

Title: Privacy-Preserving Data Mining

1
Privacy-Preserving Data Mining

Representor Li Cao
October 15, 2003

2
Presentation organization

Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme using
random perturbation
Privacy Preserving Data Mining using Randomized
Response Techniques
Comparing these two cases

3
Privacy protection historyPrivacy concerns
nowadays

Citizens attitude
Scholars attitude

4
Internet users attitudes
5
Privacy value

Filtering to weed out unwanted information
Better search results with less effort
Useful recommendations
Market trend
Example
From the analysis of a large number of
purchase transaction records with the costumers
age and income, we know what kind of costumers
like some style or brand.

6
Motivation (Introducing Data Mining )

Data Minings goal discover knowledge, trends,
patterns from large amounts of data.
Data Minings primary task developing accurate
models about aggregated data without access to
precise information in individual data records.
(Not only discovering knowledge but also
preserving privacy)

7
Presentation organization

Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme using
random perturbation
Privacy Preserving Data Mining using Randomized
Response Techniques
Comparing these two cases

8
Privacy Preserving Data Mining scheme using
random perturbation

Basic idea
Reconstruction procedure
Decision-Tree classification
Three different algorithms

9
Attribute list of a example
10
Records of a example
11
Privacy preserving methods

Value-class Membership the values for an
attribute are partitioned into a set of disjoint,
mutually-exclusive classes.
Value Distortion xir instead of xi where r is
a random value.
a) Uniform
b) Gaussian

12
Basic idea

Original data ? Perturbed data (Let users provide
a modified value for sensitive attributes)
Estimate the distribution of original data from
perturbed data
Build classifiers by using these reconstructed
distributions (Decision-tree)

13
Basic Steps
14
Reconstruction problem

View the n original data value x1,x2,xn of a
one-dimensional distribution as realizations of n
independent identically distributed (iid) random
variables X1,X2,Xn, each with the same
distribution as the random variable X.
To hide these data values, n independent random
variables Y1,Y2Yn have been used, each with the
same distribution as a different random variable
Y.

Given x1y1, x2y2xnyn (where yi is the
realization of Yi) and the cumulative
distribution function Fy for Y, we would like to
estimate the cumulative distribution function Fx
for X.
In short, given a cumulative distribution Fy and
the realizations of n iid random samples X1Y1,
X2Y2,XnYn, estimate Fx.

16
Reconstruction process

Let the value of XiYi be wi (xiyi). Use Bayes
rule to estimate the posterior distribution
function Fx1 (given that X1Y1w1) for X1,
assuming we know the density function fx and fy
for X and Y respectively.

To estimate the posterior distribution function
Fx given x1y1, x2y2xnyn, we average the
distribution function for each of the Xi.

The corresponding posterior density function, fx
is obtained by differentiating Fx
Given a sufficiently large number of samples, fx
will be very close to the real density function
fx.

19
Reconstruction algorithm

fx0 Uniform distribution
j 0 //Iteration number
Repeat
j j1
until (stopping criterion met)

20
Stopping Criterion

Observed randomized distribution ? The result of
randomizing the current estimate of the original
distribution
The difference between successive estimates of
the original distribution is very small.

21
Reconstruction effect
22
(No Transcript)
23
Decision-Tree Classification

Tow stages
(1) Growth (2) Prune
Example

24
Tree-growth phase algorithm

Partition (Data S)
begin
if (most points in S are of the same
class)
then return
for each attribute A do
evaluate splits on attribute A
Use best split to partition S into S1
and S2
Partition (S1)
Partition (S2)
end

25
Choose the best split

Information gain (categorical attributes)
Gini index (continuous attributes)

26
Gini index calculation

(pj is the relative frequency of class j in
S)
If a split divides S into two subsets S1 and S2
Note Calculating this index requires only
the distribution of the class values.

27
(No Transcript)
28
When How original distribution are reconstructed

Global Reconstruct the distribution for each
attribute once. Decision-tree classification.
ByClass For each attribute, first split the
training data by class, then reconstruct the
distributions separately. Decision-tree
classification
Local The same as in ByClass, however, instead
of doing reconstruction only once, reconstruction
is done at each node.

29
Example (ByClass and Local)
30
Comparing the three algorithms
Execution Time Accuracy
Global Cheapest Worst
ByClass Middle Middle
Local Most expensive Best
31
Presentation Organization

Associate Data Mining with Privacy
Privacy Preserving Data Mining scheme using
random perturbation
Privacy Preserving Data Mining using Randomized
Response Techniques
Comparing these two cases
Are there any other classification methods
available?

32
Privacy Preserving Data Mining using Randomized
Response Techniques

Randomized Response
Building Decision-Tree
Key Information Gain Calculation
Experimental results

33
Randomized Response

A survey contains a sensitive attribute A.
Instead of asking whether the respondent has the
attribute A, ask two related questions, the
answer to which are opposite to each other (have
A ? no A).
Respondent use a randomizing device to decide
which question to answer. The device is designed
in such way that the probability of choosing the
first question is ? .

To estimate the percentage of people who has the
attribute A, we can use
P(Ayes) P(Ayes) ? P(Ano)(1- ?)
P(Ano) P(Ano) ? P(Ayes)(1- ?)
P(kyes) The proportion of the yes
responses obtained from the survey data.
P(kyes) The estimated proportion of the
yes responses
Our Goal P(Ayes) and P(Ano)

35
Example

Sensitive attribute Married?
Two questions
A? Yes / No
B? No / Yes

36
Decision-Tree (Key Info Gain)

m m classes assumed
Qj The relative frequency of class j in S
v any possible value of attribute A
Sv The subset of S for which attribute A has
value v.
Sv The number of elements in Sv
S The number of elements in S

P(E) The proportion of the records in the
undisguised data set that satisfy Etrue
P(E) The proportion of the records in the
disguised data set that satisfy Etrue
Assume the class label is binary.
So the Entropy(S) can be calculated.
Similarly, calculate Sv. At last, we get
Gain(S,A).

38
Experimental results
39
Comparing these two cases
Perturbation Response
Attribute Continuous Categorical
Privacy Preserving method Value distortion Randomized response
Choose attribute to split Gini index Information Gain
Inverse procedure Reconstruct distribution Estimate P(E) from P(E)
40
Future work