Title: Other Perturbation Techniques
1Other Perturbation Techniques
2Outline
- Randomized Responses
- Sketch
- Project ideas
3Randomized Responses
- Problem description
- A provides the answer to Bs question
- A wants to preserve his/her privacy
- Question/answer can be sensitive
- The method
- Assume the answer can be yes or no
- A has a probability ? to be honest, and the
probability 1- ? to give a random response - We can estimate the real probability of yes and
no from the randomized responses -
4- Notations
- O(yes) observed probability of yes from the
randomized responses - of yes/total of responses
- P(yes) real probability of yes
- Inference
- O(yes) P(yes) ? P(no)(1-?)
- P(yes) ? (1-P(yes))(1-?)
- ? P(yes) (O(yes)?-1)/(2?-1)
5- Extend to multiple categories
- The answer ci has a prob ?ij changed to cj
- O((c1,c2,,cn)) observed prob of ci
- P((c1,c2,,cn)) real prob of ci
- The relationship between O and P
-
Note When ? is invertible, use matrix inversion
to solve P. Otherwise, use iterative
methods similar to that in Rakeshs paper
6- Different perturbation matrices can be used.
Which one is the best? - Balance between privacy and utility?
Zero privacy is preserved, while full data
utility is preserved
Uniform randomization, privacy is fully
preserved, while no data utility is left
7Optimizing both privacyutility
- Read paper 33
- Privacy similar to previous discussion
- Based on accuracy of estimation
- A Bayes method
- C c1,c2,,cn)
- Y is the perturbed value, X is the original
value, and X is the estimated value
Accuracy of estimation
It can be calculated by checking the original
data, the perturbed data and the estimated data
8- Privacy
- Average 1- (accuracy of estimation)
- Worst case
- Utility
- P(ci) the original prob, O(ci) the prob on
perturbed data, P(ci) is the estimated prob - Utility depends on the difference between the
original prob and the estimated prob
9Optimization algorithm
- Find the perturbation that balance the two
metrics - The evolutionary algorithm
- Start with a set of initial RR matrices
- Repeat the following steps in each iteration
- Mating selecting two RR matrices in the pool
- Crossover exchanging several columns between the
two RR matrices - Mutation change some values in a RR matrix
- Meet the privacy bound filtering the resultant
matrices - Evaluate the fitness value for the new RR
matrices. - Note the fitness values is defined in terms of
privacy and utility metrics
10(No Transcript)
11summary
- Randomized response is the basic technique for
perturbing categorical data - Boolean
- Multi-category
12Sketch
- Address the problem of high-dimensional sparse
data - Multiplicative perturbation
- Randomized responses
- Market basket data
- Bag of words
13Definition of sketch
- Similar to projection perturbation
- Map d dimensional data ? r dimensional data, rltltd
- Difference for each record the mapping matrix is
different - Definition
- X (x1,xd), S(s1,,sr)
-
is randomly drawn from -1, 1
14property
- Dot product of the original data X and Y can be
approximated with their sketches - Dot product is important in calculating Euclidean
distances!
15- Accuracy of the dot product estimation
Large r ? smaller variance ? better quality
however, ? lower privacy
16Privacy
- Original data value can be estimated
- Sparse data
- Most are canceled in sketch
- Estimate of xk
17privacy
Suppress the record if this condition is not
satisfied
Another concept K-variance paper 29 for more
details.
18- Applications
- Dot product estimation
- Determine the length of sparse transaction ( of
non-zero items in boolean vector) - Determine Euclidean distance
- Average of a set of records (centroid of a
cluster)