Title: Privacy-Preserving Data Mining
1Privacy-Preserving Data Mining
- Representor Li Cao
- October 15, 2003
2Presentation organization
- Associate Data Mining with Privacy
- Privacy Preserving Data Mining scheme using
random perturbation - Privacy Preserving Data Mining using Randomized
Response Techniques - Comparing these two cases
3Privacy protection historyPrivacy concerns
nowadays
- Citizens attitude
- Scholars attitude
4Internet users attitudes
5Privacy value
- Filtering to weed out unwanted information
- Better search results with less effort
- Useful recommendations
- Market trend
- Example
- From the analysis of a large number of
purchase transaction records with the costumers
age and income, we know what kind of costumers
like some style or brand.
6Motivation (Introducing Data Mining )
- Data Minings goal discover knowledge, trends,
patterns from large amounts of data. - Data Minings primary task developing accurate
models about aggregated data without access to
precise information in individual data records.
(Not only discovering knowledge but also
preserving privacy)
7Presentation organization
- Associate Data Mining with Privacy
- Privacy Preserving Data Mining scheme using
random perturbation - Privacy Preserving Data Mining using Randomized
Response Techniques - Comparing these two cases
8Privacy Preserving Data Mining scheme using
random perturbation
- Basic idea
- Reconstruction procedure
- Decision-Tree classification
- Three different algorithms
9Attribute list of a example
10Records of a example
11Privacy preserving methods
- Value-class Membership the values for an
attribute are partitioned into a set of disjoint,
mutually-exclusive classes. - Value Distortion xir instead of xi where r is
a random value. - a) Uniform
- b) Gaussian
12Basic idea
- Original data ? Perturbed data (Let users provide
a modified value for sensitive attributes) - Estimate the distribution of original data from
perturbed data - Build classifiers by using these reconstructed
distributions (Decision-tree)
13Basic Steps
14Reconstruction problem
- View the n original data value x1,x2,xn of a
one-dimensional distribution as realizations of n
independent identically distributed (iid) random
variables X1,X2,Xn, each with the same
distribution as the random variable X. - To hide these data values, n independent random
variables Y1,Y2Yn have been used, each with the
same distribution as a different random variable
Y.
15- Given x1y1, x2y2xnyn (where yi is the
realization of Yi) and the cumulative
distribution function Fy for Y, we would like to
estimate the cumulative distribution function Fx
for X. - In short, given a cumulative distribution Fy and
the realizations of n iid random samples X1Y1,
X2Y2,XnYn, estimate Fx.
16Reconstruction process
- Let the value of XiYi be wi (xiyi). Use Bayes
rule to estimate the posterior distribution
function Fx1 (given that X1Y1w1) for X1,
assuming we know the density function fx and fy
for X and Y respectively.
17- To estimate the posterior distribution function
Fx given x1y1, x2y2xnyn, we average the
distribution function for each of the Xi.
18- The corresponding posterior density function, fx
is obtained by differentiating Fx - Given a sufficiently large number of samples, fx
will be very close to the real density function
fx.
19Reconstruction algorithm
- fx0 Uniform distribution
- j 0 //Iteration number
- Repeat
- j j1
- until (stopping criterion met)
20Stopping Criterion
- Observed randomized distribution ? The result of
randomizing the current estimate of the original
distribution - The difference between successive estimates of
the original distribution is very small.
21Reconstruction effect
22(No Transcript)
23Decision-Tree Classification
- Tow stages
- (1) Growth (2) Prune
- Example
24Tree-growth phase algorithm
- Partition (Data S)
- begin
- if (most points in S are of the same
class) - then return
- for each attribute A do
- evaluate splits on attribute A
- Use best split to partition S into S1
and S2 - Partition (S1)
- Partition (S2)
- end
25Choose the best split
- Information gain (categorical attributes)
- Gini index (continuous attributes)
26Gini index calculation
-
- (pj is the relative frequency of class j in
S) - If a split divides S into two subsets S1 and S2
-
-
- Note Calculating this index requires only
the distribution of the class values.
27(No Transcript)
28When How original distribution are reconstructed
- Global Reconstruct the distribution for each
attribute once. Decision-tree classification. - ByClass For each attribute, first split the
training data by class, then reconstruct the
distributions separately. Decision-tree
classification - Local The same as in ByClass, however, instead
of doing reconstruction only once, reconstruction
is done at each node.
29Example (ByClass and Local)
30Comparing the three algorithms
Execution Time Accuracy
Global Cheapest Worst
ByClass Middle Middle
Local Most expensive Best
31Presentation Organization
- Associate Data Mining with Privacy
- Privacy Preserving Data Mining scheme using
random perturbation - Privacy Preserving Data Mining using Randomized
Response Techniques - Comparing these two cases
- Are there any other classification methods
available?
32Privacy Preserving Data Mining using Randomized
Response Techniques
- Randomized Response
- Building Decision-Tree
- Key Information Gain Calculation
- Experimental results
33Randomized Response
- A survey contains a sensitive attribute A.
- Instead of asking whether the respondent has the
attribute A, ask two related questions, the
answer to which are opposite to each other (have
A ? no A). - Respondent use a randomizing device to decide
which question to answer. The device is designed
in such way that the probability of choosing the
first question is ? .
34- To estimate the percentage of people who has the
attribute A, we can use - P(Ayes) P(Ayes) ? P(Ano)(1- ?)
- P(Ano) P(Ano) ? P(Ayes)(1- ?)
- P(kyes) The proportion of the yes
responses obtained from the survey data. - P(kyes) The estimated proportion of the
yes responses - Our Goal P(Ayes) and P(Ano)
35Example
- Sensitive attribute Married?
- Two questions
- A? Yes / No
- B? No / Yes
36Decision-Tree (Key Info Gain)
- m m classes assumed
- Qj The relative frequency of class j in S
- v any possible value of attribute A
- Sv The subset of S for which attribute A has
value v. - Sv The number of elements in Sv
- S The number of elements in S
37- P(E) The proportion of the records in the
undisguised data set that satisfy Etrue - P(E) The proportion of the records in the
disguised data set that satisfy Etrue - Assume the class label is binary.
- So the Entropy(S) can be calculated.
Similarly, calculate Sv. At last, we get
Gain(S,A).
38Experimental results
39Comparing these two cases
Perturbation Response
Attribute Continuous Categorical
Privacy Preserving method Value distortion Randomized response
Choose attribute to split Gini index Information Gain
Inverse procedure Reconstruct distribution Estimate P(E) from P(E)
40Future work
- Solve categorical problems by the first scheme
- Solve continuous problems by the second scheme
- Combine these two scheme to solve some problems
- Other classification suitable