Title: Publishing Set-Valued Data via Differential Privacy
1Publishing Set-Valued Data via Differential
Privacy
- Rui Chen, Concordia University
- Noman Mohammed, Concordia University
- Benjamin C. M. Fung, Concordia University
- Bipin C. Desai, Concordia University
- Li Xiong, Emory University
VLDB 2011
2Outline
- Introduction
- Preliminaries
- Sanitization algorithm
- Experimental results
- Conclusions
3Introduction
- The problem non-interactive set-valued data
publication under differential privacy
- Typical set-valued data transaction data, web
search queries
4Introduction
- Set-valued data refers to the data in which each
record owner is associated with a set of items
drawn from an item universe.
TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4
5Introduction
- Existing works 1, 2, 3, 4, 5, 6, 7 on
publishing set-valued data are based on
partitioned-based privacy models 8. - They provide insufficient privacy protection.
- Composition attack 8
- deFinetti attack 9
- Foreground knowledge attack 10
- They are vulnerable to background knowledge.
6Introduction
- Differential privacy is independent of an
adversary background knowledge and computational
power (with exceptions 11). - The outcome of any analysis should not overly
depend on a single data record. - Existing differentially private data publishing
approaches are not adequate in terms of both
utility and scalability for our problem.
7Introduction
- Problems of data-independent publishing
approaches
I1
I2
I3
I1, I2
I1, I3
I2, I3
I1, I2, I3
Universe I I1, I2, I3
- Scalability O(2n)
- Utility noise accumulates exponentially
8Outline
- Introduction
- Preliminaries
- Sanitization algorithm
- Experimental results
- Conclusions
9Preliminaries
- Context-free taxonomy tree
-
- Each internal node is a set of their leaves, not
necessarily the semantic generalization
10Preliminaries
D
D
- D and D are neighbors if they differ on at most
one record
A non-interactive privacy mechanism A gives
e-differential privacy if for all neighbours D,
D, and for any possible sanitized database D ?
Range(A), PrAA(D) D
exp(e) PrAA(D) D
11Preliminaries
Global Sensitivity
For example, for a single counting query Q over a
dataset D, returning Q(D)Laplace(1/e) gives
e-differential privacy.
12Preliminaries
Given a utility function q (D R) ? R for a
database instance D, the mechanism A, A(D, q)
return r with probability ?
exp(eq(D, r)/2?q) gives e-differential
privacy.
13Preliminaries
- Composition properties 14
Sequential composition ?iei differential
privacy
Parallel composition max(ei)differential privacy
14Preliminaries
For a given itemset I I , a counting query Q
over a dataset D is defined to be
A privacy mechanism A is (a, d)-useful if with
probability 1- d, for every counting query and
every dataset D, for DA(D), Q(D)-Q(D)lt a.
15
15Outline
- Introduction
- Preliminaries
- Sanitization algorithm
- Experimental results
- Conclusions
16Sanitization Algorithm
TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4
- Generalize all records to a single partition
- Keep partitioning non-empty partitions until leaf
partitions are reached
17Sanitization Algorithm
- Privacy budget allocation
- We reserve B/2 to obtain noisy sizes of leaf
partitions and the rest B/2 to guide the
partitioning. - Assign less budget to more general partitions and
more budget to more specific partitions.
18Sanitization Algorithm
- Privacy budget allocation
A hierarchy cut needs at most
partition operations to reach leaf partitions.
Example I1,2, I3, 4 needs at most two
partition operations to reach leaf partitions
19Sanitization Algorithm
- Privacy budget allocation
- We reserve B/2 to obtain noisy sizes of leaf
partitions and the rest B/2 to guide the
partitioning. - Assign less budget to more general partitions and
more budget to more specific partitions.
B/2/3 B/6
(B/2-B/6)/2 B/6
B/6B/2 2B/3
20Sanitization Algorithm
- For a non-leaf partition, we need to consider all
possible sub-partitions to satisfy differential
privacy. - Efficient implementation separately handling
empty and non-empty partitions (inspired by 16).
21Outline
- Introduction
- Preliminaries
- Sanitization algorithm
- Experimental results
- Conclusions
22Experiments
- Two real-life set-valued datasets are used.
- MSNBC is publicly available at UCI machine
learning repository(http//archive.ics.uci.edu/ml/
index.html). - STM is provided by Societe de transport de
Montreal (STM) (http//www.stm.info).
23Experiments
- Average relative error vs. privacy budget
B0.5
B0.75
B1.0
24Experiments
- Utility for frequent itemset mining
B0.75
B0.5
B1.0
25Experiments
Runtime vs. D
Runtime vs. I
26Outline
- Introduction
- Preliminaries
- Sanitization algorithm
- Experimental results
- Conclusions
27Conclusions
- Differential privacy can be successfully applied
to non-interactive set-valued data publishing
with guaranteed utility. - Differential privacy can be achieved by
data-dependent solutions with improved efficiency
and accuracy. - The general idea of data-dependent solutions
applies to other types of data, for example,
relational data 17 and trajectory data 18.
28References
- 1 J. Cao, P. Karras, C. Raissi, and K.-L. Tan.
?uncertainty inference proof transaction
anonymization. In VLDB, pp. 10331044, 2010. - 2 G. Ghinita, Y. Tao, and P. Kalnis. On the
anonymization of sparse high-dimensional data. In
ICDE, pp. 715724, 2008. - 3 Y. He and J. F. Naughton. Anonymization of
set-valued data via top-down, local
generalization. In VLDB, pp. 934945, 2009. - 4 M. Terrovitis, N. Mamoulis, and P. Kalnis.
Privacy-preserving anonymization of set-valued
data. In VLDB, pp.115125, 2008. - 5 M. Terrovitis, N. Mamoulis, and P. Kalnis.
Local and global recoding methods for anonymizing
set-valued data.VLDBJ, 20(1)83106, 2011. - 6 Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu,
and J. Pei. Publishing sensitive transactions for
itemset utility. In ICDM, pp. 11091114, 2008. - 7 Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu.
Anonymizing transaction databases for
publication. In SIGKDD, pp. 767775, 2008.
29References
- 8 S. R. Ganta, S. P. Kasiviswanathan, and A.
Smith. Composition attacks and auxiliary
information in data privacy. In SIGKDD, pp.
265-273, 2008. - 9 D. Kifer. Attacks on privacy and deFinettis
theorem. In SIGMOD, pp. 127138, 2009. - 10 R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and
J. Pei. Can the utility of anonymized data be
used for privacy breaches, ACM Transactions on
Knowledge Discovery from Data, to appear. - 11 D. Kifer and A. Machanavajjhala. No free
lunch in data privacy. In SIGMOD, 2011. - 12 C. Dwork, F. McSherry, K. Nissim, and A.
Smith. Calibrating noise to sensitivity in
private data analysis. In Theory of Cryptography
Conference, pp. 265284, 2006. - 13 F. McSherry and K. Talwar. Mechanism design
via differential privacy. In FOCS, pp. 94103,
2007. - 14 F. McSherry. Privacy integrated queries An
extensible platform for privacy-preserving data
analysis. In SIGMOD, pp. 1930, 2009. - 15 A. Blum, K. Ligett, and A. Roth. A learning
theory approach to non-interactive database
privacy. In STOC, pp.609618, 2008.
30References
- 16 G. Cormode, M. Procopiuc, D. Srivastava, and
T. T. L. Tran. Differentially Private Publication
of Sparse Data. In CoRR, 2011. - 17 N. Mohammed, R. Chen, B. C. M. Fung, and P.
S. Yu. Differentially private data release for
data mining. In SIGKDD, 2011. - 18 R. Chen, B. C. M. Fung, and B. C. Desai.
Differentially private trajectory data
publication. ICDE, under review, 2012.
31 32 33Lower Bound Results
- In the interactive setting, only a limited number
of queries could be answered otherwise, an
adversary would be able to precisely reconstruct
almost the entire original database. - In the non-interactive setting, one can only
guarantee the utility of restricted classes of
queries.
34(No Transcript)
35(No Transcript)
36Threshold Selection
- We design the threshold as a function of the
standard deviation of the noise and the height of
a partitions hierarchy cut
37Relative error
- (a, d)-usefulness is effective to give an overall
estimation of utility, but fails to produce
intuitive experimental results. - We experimentally measure the utility of
sanitized data for counting queries by relative
error
Sanity bound
38Experiments
- Average relative error vs. taxonomy tree fan-out
B0.75
B0.5
B1.0